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Abstract 

3D visualization is an important data analysis and knowledge discovery tool, however, interactive visualization of large 3D 
astronomical datasets poses a challenge for many existing data visualization packages. We present a solution to interactively 
visualize larger-than-memory 3D astronomical data cubes by utilizing a heterogeneous cluster of CPUs and GPUs. The system 
partitions the data volume into smaller sub-volumes that are distributed over the rendering workstations. A GPU-based ray casting 
volume rendering is performed to generate images for each sub-volume, which are composited to generate the whole volume output, 
O^l and returned to the user. Datasets including the HI Parkes All Sky Survey (HIPASS - 12 GB) southern sky and the Galactic All Sky 
Survey (GASS - 26 GB) data cubes were used to demonstrate our framework's performance. The framework can render the GASS 
^ data cube with a maximum render time < 0.3 second with 1024 x 1024 pixels output resolution using 3 rendering workstations and 
8 GPUs. Our framework will scale to visualize larger datasets, even of Terabyte order, if proper hardware infrastructure is available. 

Keywords: methods: data analysis, techniques: miscellaneous 
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1. Introduction 

Visualization is the process of generating images of data in 
order to aid knowledge discovery. Visualization is an integral 
part of astronomy, playing a role in all stages of research (plan- 
ning, data monitoring, quality control, analysis, and interpreta- 
tion) and disemmination (publication, presentation, and public 
outreach). 

Three-dimensional (3D) visualization has proven to be of 
great value for studying and interpreting spectral data cubes 
from radiotelescopes (Norris 19941, and more recently from 



optical telescopes fitted with integral field units. A spectral data 
cube is a regular, scalar 3D data lattice. Two axes define (angu- 
lar) sky coordinates, and the third axis defines a (usually linear) 
spectral abcissa. While sky coordinates can ordinarily be trans- 
formed non-degenerately to a physical (spatial) representation, 
the same can only be done for the spectral abcissa under special 
circumstances. Nevertheless, it is standard practice to treat all 
three axes equally and display both axially-aligned 2D slices, 
and arbitrary 3D projections of the cube. 
3D visualization of spectral data cubes can: 

• give improved 3D perception of the data and enhanced 
comprehension of global properties (e.g. Figure [T] of this 
paper); 

• be employed as a quality control tool to detect and inves- 



tigate instrumental and data processing errors, ( Oosterloo 
[T996l|Beeson et al.||2003) ; 



enable innovative quantitative data analysis through the 
selection and characterization of 3D regions and compar- 
isons with simulations [e.g. [Fluke et al.| ( |2010) ]; and 

support the discovery of strange phenomena, unexpected 
relations, or previously unidentified patterns that cannot be 



accomplished with automated techniques (Beeson et al. 
|2003) . 

1.1. Volume rendering 

One particularly useful technique for studying 3D data vol- 
umes is volume rendering. Here, a color-coded 2D projection of 
the 3D data is generated by emulating an optical model that de- 
scribes the interaction of light emitted, absorbed, or reflected by 
elements that make up that volume [e.g. Drebi n et al.| (|T988); 
Lacroute and Levoy| ( |1994| i; |Levoy| ( |1990) ; |Schwarz| ( [2007) ] . 



Volume rendering gives the viewer a global picture of a 3D 
dataset by displaying large and small-scale features, as well 
as internal and external structures. While rendering of isosur- 
faces or manually-segmented surfaces can be used to visualize 
spectral cubes, these techniques frequently fail for two reasons: 
most astronomical sources do not have well-defined surfaces, 
especially in the (typically) low signal-to-noise regime of spec- 
tral line astronomy; and spectral cubes are not directly repre- 
sentative of (or transformable to) a 3D spatial physical repre- 
sentation, and so interpretation of the surface is difficult. In 
contrast to surface rendering, volume rendering remains useful 
where clear feature segmentation cannot be done ( |Beeson et al.[ 
[20031 iGoochl [T995] |Qosterloo| 1 1 995) . 

One of the earliest volume rendering applications in astron- 
omy was in 1992 by Domik and colleagues at the Univer- 
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sity of Colorado (Domik and Mickus-Miceli 1992 Brugel 
1993). They introduced a preliminary implementation, which 
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they called "translucent representation". Despite the limited 
graphics and processing capabilities available at the time, they 
favored volume rendering over the other techniques provided 
in their software suite (such as isosurfaces and data slicing). 
Contemporary astronomy volume rendering implementations 
include those that provide domain-specific transfer functions 



[e.g. the hot gas shader (Oosterloo 1996 1], those that offer 
effective handling of adaptive grids and different data resolu- 



tions ( |Kaehler et aL] |2006[ |Nadeau et al.[|2001||Magnor et al. 
2005), and one that addresses the larger-than-memory data size 
problem ( Beeson et al. 2003| l. 



1.2. The larger-than-memory problem 

The largest spectral line cubes from surveys carried out with 
contemporary radiotelescopes are typically several gigabytes 
(GB) in size. For example, the image cube of the entire south- 
ern sky generated from HIPASS data (forthwith, the "HIPASS 
cube", |Barnes et al.[|200"T] l measures 1721 x 1721 x 1024 vox- 
els|^| and expressed as four-byte floats occupies ~ 12 GB in 
memory or on disk. The image cube of the entire sky gen- 
erated from GASS data ( |McClure-Griffiths et aL] |2Q09| > mea- 
sures 2502 x 2501 x 1093 voxels, and occupies ~ 25 GB. 
The next generation of radiotelescopes [e.g. Australian Square 
Kilometer Array Pathfinder (ASKAPQ LOw Frequency AR- 
ray (LOFARf] Murchison Widefield Array (MWAQ , and 
Karoo Array Telescope (MeerkatQ will produce even larger 
cubes. The standard, single-pointing spectral line cube from 
ASKAP for example is expected to have dimensions of order 
6144 x 6144 x 16384, occupying ~ 2.5 terabytes (TB). 

Volume rendering of these large cubes — from existing and 
planned radiotelescopes — at interactive frame rates [i.e. better 
than ~ 5 frames per second (fps)], is well beyond the capa- 
bility of a single, standalone workstation. The principal limit- 
ing factor is memory capacity, and so we refer to this problem 
as the larger-than-memory data size problem. One "solution" 
to the larger-than-memory problem is to simply extract a sub- 
cube whose data can fit in memory and accept that visualising 
the original spectral cube in its entirety is not possible. While 
there are circumstances where this is an acceptable solution, it 
is sometimes impractical (a cube may need to be visualized in 
10 or more "pieces"), and there can be significant value in vi- 
sualising a large data set in its entirety. 

Consider for example the depiction of the HIPASS cube in 
Figure [T] This volume rendering, accomplished with the tech- 
nique we present in this paper, provides a striking global sum- 
mary of not just the scientific content of the data — the Magel- 
lanic Clouds and Stream (red feature near the centre of the left- 
facing facet of the cube); the residual continuum emission, after 
bandpass calibration, in the plane of the Milky Way Galaxy (the 
green arc-like feature on the left-facing facet); and hundreds of 
galaxies detected in the 21 cm neutral Hydrogen emission line 



1 A voxel is a volume element as a pixel is a picture element. 
2 http ://w w w. atnf .csiro. au/S KA/ 



(short bars running along the spectral axis), which are weakly 
clustered and more numerous nearby (towards the left) — but 
also the numerous artefacts present in the processed data. For 
example, residual continuum emission from the Galaxy extends 
through the entire spectral space (the ramp associated with the 
green arc-like feature), with an intensity variation ("ripple") 
along the spectral axis correlated on the sky (e.g. the increased 
intensity on the upper part of the ramp at the centre of the im- 
age). 

The larger-than-memory problem has been previously exam- 



ined in astronomy by Beeson et al. (|2003 1 who implemented the 
distributed shear-warp algorithm ( |Lacroute and Levoy 1994| l 
over a Beowulf-style cluster. In their solution (dvr — distributed 
volume Tenderer), the volume data was segmented and dis- 
tributed to cluster nodes, rendered locally, and the (sub-)images 
were combined by a controlling node using an associative com- 
positing operator. Dvr demonstrated good rendering speeds 
compared to other solutions at the time, but does not scale 
well to today's largest radioastronomy cubes (see their Fig- 
ure 9). Even if dvr scaled perfectly with no parallelization 
costs, ~ 120 6-core Westmere processor^] with each core han- 
dling 20 Mvox sec 1 , would be needed to render the HIPASS 
cube at 5 fpsQ 

1.3. This work 

In this paper, we present a new solution to the larger-than- 
memory problem, using a significantly smaller computer sys- 
tem than dvr requires for the same input image cube size. Our 
objective is to provide astronomers with a practical tool to inter- 
actively explore and visualize the largest radioastronomy spec- 
tral data cubes in real time. Our initial focus is visualizing data 
from ASKAP, however, it is also applicable to facilities such 
as LOFAR, ALMA^] MWA, and existing large datasets such as 
the HIPASS cube. We begin with the HIPASS and GASS data 
cubes as benchmarks to test our solution performance and scal- 
ability. Both of these datasets are sufficiently large to provide a 
valid test of our volume rendering framework (see results sec- 
tion for more details). 

This work is about removing a potential technological barrier 
and enabling astronomers to have a qualitative look at their data 
as a first step to understanding the complex elements that will 
occur in the largest radioastronomy datacubes. In particular, we 
assert that global views can play a vital role as a quality con- 
trol tool, especially since some of the upcoming facilities (e.g. 
ASKAP) will not be able to keep all the raw observational data 
after the initial processing phase. Being able to see the data 
products from such facilities in a real-time and interactive way 
may save a lot of precious time and data, and provide oppor- 
tunities for reprocessing while the raw data products still exist. 
Furthermore, we anticipate that global views may aid in dis- 
covering systematic and non-systematic noise effects, such as 
signals that vary across the sky due to calibration issues, which 
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Figure 1: Volume rendering of the HIPASS cube, accomplished with the approach described in this paper. The southern sky cube was generated by Russell Jurek 
(ATNF) from 387 individual cubes. See Section 1.2 for a description of the features in this cube. 
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may otherwise take huge processing or data reduction effort to 
determine and extract (e.g. HIPASS data cube in figure[T|. 



2. Ray casting 

A volume rendering is generated by a process called ray 
casting, which computes the projection of a coloured, semi- 
transparent volume onto a (finite) 2D viewing plane. The colour 
and the opacity of each voxel are derived from its data value us- 
ing a predefined mapping operator called a "transfer function". 
For each pixel on the viewing plane, the ray casting process 
computes a ray originating at this pixel and projects it into the 
data volume. The ray is traced through the volume, accumu- 
lating an aggregate colour and opacity which is assigned to the 
pixel. The process of compositing any two voxel colours is 
also defined by the selected transfer function; see |Levoy] (fl990) 
for a detailed description of the original ray casting algorithm. 
Although it is a computationally intensive task, ray casting is 
trivially parallel. This parallel nature has motivated the de- 
velopment of number of parallel ray casting algorithms [e.g. 
Goel and Mukherjee|(|1996|); [Maximo et aL[([2008|l;|Scharsach 



( [2005) ; |J£etaT[ pOTO] )] . 

Graphics Processing Units (GPUs) are the silicon proces- 
sors used to deliver the computations required for 3D computer 
graphics. They represent a cheap, commodity hardware (on 
graphics cards and non-graphics co-processor cards) that can 
now be utilized as massively-parallel, general purpose compu- 
tational co-processors, byusing software development kits such 
as CUDA^f and OpenCL^] GPUs have been taken up rapidly 
by the astronomy computing community [e.g. 



Hamada et al. 



(200^; |^yffietd^p009l >; |Schive et al.|p0T0> ] and in the field 
of astrophysical visualization, GPUs have been well-utilized 
for N-body particle data (|Kaehler et al.| [2006[ |Li et al.[ [2008 



|Szalay et al.| [20081 F"i"et al.| |2010| |Becciani et al.| |2010| >. In 



these examples, GPUs have been used to enhance the rendering 
speed of the graphics primitives by e.g. ~ 23% for the Splotch 
code pin et al.| 2010| l, but these approaches are still limited to 
datasets that fit within a single machine memory. 

Motivated by the appearance of GPUs with large local mem- 
ory (e.g. 1.7 GB in the AMD ATI Radeon 5970; 1.5 GB in the 
NVIDIA GX285; up to 4 GB in the NVIDIA Tesla products), 
and the suitability of the massively-parallel GPU architecture 
to the ray casting algorithm, we developed a framework which 
utilizes a heterogeneous CPU and GPU hardware infrastruc- 
ture, combining shared- and distributed-memory architectures, 
to yield a scalable volume rendering solution, capable of vol- 
ume rendering image cubes larger than a single machine mem- 
ory limit, in real-time and at interactive frame rates. We utilize 
GPUs as massively-parallel floating point co-processors and 
consequently, our framework may be suitable for other parallel 
scientific computing applications, especially those processing 
or analysing larger-than-memory images. 



2.1. Hardware and software architecture 

To address the distributed volume rendering framework, we 
describe both the hardware and software architecture. Figure [2] 
shows a conceptual diagram for our hardware architecture. This 
architecture consists of the following main components: 

1 . A Cluster of interconnected nodes, each featuring one or 
more GPU cards and one or more CPU cores; 

2. A Server machine, which will schedule tasks, exchange 
messages between the cluster and the viewer application, 
for final result composition; 

3. Viewer machine(s), which are a regular workstations as- 
sociated with a display device and I/O mechanism with a 
connection to the server node; and 

4. A File Server, which is a storage node (can be physically 
any of the processing clients or the server machine) acces- 
sible by all the rendering nodes, the server, and the viewer 
machine. 

Based on this architecture configuration, we made the following 
design assumptions: 

• The viewer machine will have only a thin-client applica- 
tion with a small memory and processing requirements. 
Only a small percentage of the processing will be done 
on the viewer machine to facilitate the user interactivity, 
the I/O operations, and the result display. This will make 
the rendering cluster accessible over geographically dis- 
tributed locations; 

• The final output at arbitrary resolution, can be directed to 
a single display or a tiled-display system. Although this 
may decrease the final frame rate, the capability to view 
a dataset at its full resolution, or even with larger resolu- 
tion, is very important. For example, some of the small 
features or details may be hidden if we view ASKAP-size 
data cubes (6144x6144 spatial pixels are anticipated) with 
regular screen resolutions(e.g 1920 x 1200 for WUXGA is 
typical); and 

• All the rendering nodes and the server can communicate 
using message passing interface (MPI) [^j] This can be 
achieved for either a static (the number of cluster nodes is 
constant) or dynamic cluster (the number of cluster nodes 
changes with the problem size). 

The supporting software architecture utilizes the following 
software components: 

• MPI: for the communication between different rendering 
nodes and the rendering server; 

• NVIDIA Compute Unified Device Architecture (CUD A): 
for the ray-casting implementation on the GPU; 

• Multithreading and message queue: for the communica- 
tion and management of different GPUs on the same ren- 
dering node; and 



http://www.nvidia.com/object/ cuda_whatjs.html 
http://www.khronos.org/opencl/ 
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Possible Remote Clients Configuration 



Rendering Cluster 



Figure 2: Conceptual diagram for the system hardware architecture. The main system layers are the viewer application, the rendering server, the data/file server, 
and the rendering client(s). The viewer application may work over different possible remote client's configuration including Web/Thin client, Desktop machine, or 
tiled display system. 



TCP direct socket communication: to communicate be- 
tween the viewer application and the server. 



The components described above were integrated in a C++ 
framework to create a distributed GPU-based rendering farm. 
This C++ framework orchestrates the process of task distribu- 
tion, data mapping, compositing, and communication between 
different GPU kernels, where the actual computation is done. 
This system assumes that the underlying hardware architec- 
ture is heterogeneous, which means that the number of GPUs 
in each rendering node, the amount of available memory, and 
possibly even processing power, is different from one node to 
the next. MPI is used for communication between the dif- 
ferent rendering nodes and the rendering server. It has been 
used because it is the de-facto standard for distributed systems. 
Only the master thread (i.e. the thread responsible for the lo- 
cal scheduling and communication on each rendering node) on 
each processing node can communicate directly with the ren- 
dering server. Also, the server has no direct control over the 
processing threads (the threads associate to the GPU units on 
each node which handle the CUDA kernel invocation and result 
transfer between GPU memory and the node's system mem- 
ory). 

The communication between the master thread and the pro- 
cessing threads on each of the rendering nodes is done using a 
shared Priority Message Queue in a completely asynchronous 
manner. This method speeds up the communication and min- 
imizes the data sharing between the different threads. Also, 



it keeps the CUDA function calls within the same thread 
When required, any Master thread and the server can commu- 
nicate together in a synchronous manner. The communication 
between the server and the viewer application is done through 
a direct TCP socket. This communication channel is opened 
and closed via the client and used to transfer control oriented 
messages and results. The current client implementation uses a 
Qlp^jbased user interface and OpenGLp]for graphical display 
and user interaction. 

2.2. Rendering overview 

The rendering process starts when the user selects a file to 
render (a menu item on the user interface). The client opens the 
requested file and loads the associated metadata. The file di- 
mensions, and the recorded minimum and maximum data val- 
ues are displayed to the user. The user may load the entire cube 
or a manually specified sub-cube. The user also has the con- 
trol to use the minimum and the maximum value recorded in 
the file's metadata or to ask the server to recalculate the cube 
minimum and maximum. Next, the file path and the selected 
cube dimensions are sent to the server to start the data loading 
process. The server performs a global scheduling task that par- 
titions the cube according to the current rendering nodes. The 
server then sends a separate file-loading request to each of the 



12 sharing CUDA context between different thread is not supported as of ver- 
sion 2.3 



1 3 |http ://qt.nokia.com/| 

14 http://www.opengl.org/ 
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processing clients. Each loading request contains the file path 
and the client's associated cube portion. In an asynchronous 
manner, the clients perform a local scheduling task that par- 
titions its associated cube portion, depending on the available 
GPUs in each client, and start loading data. Once the data is 
loaded, the data is transferred from the system memory to the 
GPU memory. 

If requested by the user, each GPU calculates the minimum 
and the maximum of its cube portion, which are then sent back 
to the server where the global minimum and maximum is cal- 
culated. The global minimum and maximum are returned to the 
client with a data loading confirmation message. The viewer 
application then generates the color map, which will be used 
in the rendering, and sends it back to the server as well. After 
these initialization steps are complete, the server will stay pend- 
ing for render requests from the viewer application. Whenever 
the user interacts with the displayed output on the viewer appli- 
cation, the rendering parameters (the transformation and pro- 
jection matrices) are sent to the server associated with a render 
request. This render request is distributed over the rendering 
nodes, which use the GPUs to generate the frame portions and 
combine (i.e. composite) them. The server performs the fi- 
nal frame composition and sends the result back to the client. 
The user is also able to perform some other interactions, such 
as changing the color map, enabling/ disabling spectral chan- 
nels), which are handled in the same manner. 

2.3. Data Partitioning 

The dataset is partitioned in an object-based manner, which 
means that each GPU gets only a portion of the data. At the 
same time each GPU is responsible for generating a portion of 
the final frame, corresponding to the projection of its assigned 
data on the current viewing plane using the ray-casting algo- 
rithm (Sabella 1988}. The final resultant frame images from 
each GPU are composited together to produce the final output 
frame. Figure [3] shows an example data partitioning mecha- 
nism. Alternative data partitioning mechanisms could be ap- 
plied, including more sophisticated techniques like binary space 



partitioning (Thibault and Naylor 1987! or an Octree (Samet 
[1995). 

The data partitioning mechanism affects the final rendering 
time and the overall performance as we show in Section 3. 
Based on the data partitioning schema and the viewing angle, 
the size of the rendering task differs from one GPU to another. 
Keeping the size of these tasks balanced is an important fac- 
tor to achieve the highest possible frame rate and interactivity 
level. We intend to leave further investigation and enhancement 
to the scheduler as future work. In the current implementation, 
each scheduling module performs a separate data partitioning 
decision based on the current longest axis, as shown in Figure 
[3] so as to achieve a "fair" distribution. 

2.4. Ray-Casting Process 

The processing threads are responsible for loading the data 
portion associated to its assigned GPU and transferring it to 
the GPU local memory. The GPU memory in our case is the 



most precious resource, because of the high time cost of data 
transfer between the GPU memory and the CPU memory, and 
the limitation of the GPU memory size (will reach 6 GB in the 
next Fermi GPLp^l, but is limited to 4 GB with the current Tesla 
cards). 

Each processing thread computes a "rendering rectangle". 
This rendering rectangle represents the region that the associ- 
ated GPU is responsible to fill with suitable colors in the final 
frame. This process is performed using the convex hull algo- 
rithm (Graham 1972| l; the projection and transformation matri- 
ces; and previous knowledge about the extent of the data cube 
stored in the GPU's memory. Each ray within the rendering 
rectangle is tested against the bounding cube for the assigned 
data portion. In the case where the ray does not intersect with 
the cube, the current ray sampling process will exit. If the ray 
intersects with the cube, the entry and exit points are calculated 
and the sampling distance is determined based on the length of 
the ray segment, which will be inside the bounding cube. The 
ray casting kernel employs an early ray-termination mechanism 



(Levoy 1990) to speed up the overall rendering process. This 
mechamism terminates the ray casting process when the accu- 
mulated pixel value reach its maximum value. The final output 
of the ray-casting process is a 2D floating point array of max- 
imum recorded scalar value (in the case of maximum intensity 
projectiorp'} for each pixel. 

The usage of these optimization steps minimizes the num- 
ber of threads needed and the amount of output buffer mem- 
ory accessed by each GPU, thus speeding up the execution and 
memory transfer time. We note that these optimization steps do 
not affect the final output resolution or image quality. Some of 
these steps are also used to optimize the final frame composi- 
tion steps, which we now describe. 

2.5. Image Composition 

The image composition is performed in two main stages (see 
Figure [4]): 

1. Local stage at every rendering node. In this stage, each 
node is compositing the results generated by its GPUs to 
generate a single buffer. This composition is done with the 
guidance of the rendering rectangle for each GPU. Each 
GPU is executing a composition process only within its 
rendering rectangle on the final local frame buffer in a se- 
quential manner; and 

2. Global stage at the server node. In this stage the server 
node is compositing the results of all the rendering nodes 
to generate the final rendering buffer. 

The use of this two stage process speeds-up the composi- 
tion process and minimizes the processing effort required by 
the server and hence the final rendering time. The final compo- 
sition complexity and validity depends on the selected transfer 
function. Lombeyda et al. (2001 ) provide a mathematical proof 
that the general alpha-blending volume rendering operator is 
associative, and can be applied in any blending order. 



15 http://www.nvidia.com/object/ fermi_architecture.html 
16 http://support. svi.nl/ wiki/MaximumlntensityProjection 
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Node (3) Node (2) Nodefl) 

(2 GPU) (2 GPU) (1 GPU) 




Figure 3: Illustration of the data partitioning process over a cluster configuration of three processing clients. The first and the second processing client have two 
GPUs, while the third one contains one GPU. The cube partitioning is done based on the longest axis, so in this example the input data cube is partitioned into 3 
parts over the X axis. For node 2 and node 3, the longest axis for their assigned cube is the Y axis. Node 1 has only one GPU so no further partition is required. 





Node 1- GPU(2) Output Node 1- GPU(1) Output 




Node 2- Output 



Node 2- GPU(2) Output 



Node 2- GPU(1) Output 



Figure 4: Illustration of the image composition process over a cluster configuration of two processing clients each with two GPUs. A first compositing step occurs 
on each node, and the final image is combined on the server. 
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Figure 5: Volume rendering of the GASS cube, accomplished with the approach described in this paper. 




3. Results 

A cluster of four interconnected workstations (the fourth al- 
ways acting as a server) and nine GPUs (one associated with the 
server) was used to conduct framework timing tests. The hard- 
ware specifications of these workstations are shown in Table [T] 
Details of the six individual hardware configurations used are 
shown in Table [2] Tests were performed using a 1024 x 1024 
pixel viewport and a pre-computed sampling distance assuring 
that each voxel value (intersected by at least one ray) is sampled 
at least once. The communication is performed over a giga- 
bit Ethernet network. Table [3] shows the details of the datasets 
used for timing tests. Sample volume rendering are shown for 
HIPASS cube (Figure [TJ, GASS cube (Figure |5), and a cube 
generated from a cosmological N-body simulation (Figure [6] 
refer Table [5J. Although the latter dataset is not from radio as- 
tronomy, it demonstrates the applicability of our framework to 
other 3D data volumes. 

We measure the total frame rendering time, Tr, which is the 
elapsed time between issuing a rendering request and receiving 
the rendered frame back. Tr is an indication of the number of 
frames per second that the framework can render, and is a sum- 
mation of the time spent on different framework sub-processes. 



T R = Max(T ray(i) ) + Max(T merge(j) ) + 
+ Max(T cornm(j) ) + T server ± e 



(1) 



Where 1 < i < Number of GPUs, 1 < j < 
Number of Workstations, T my indicates the time spent on the 
GPU doing ray tracing, T meTge is the time spent on each client 
for local merging, T comm is the communication time between 
each workstation and the server, and r selver is the time spent by 
the server doing the global merging. The last component, e, in- 
dicates that variation happens due to the overlapping effect and 
any unexpected delays (see below). 

Due to the framework's distributed behaviour and the het- 
erogeneous hardware, the maximum time spent in each sub- 
process is dominated by the time spent by the slowest process- 
ing element. For example, the ray casting process times vary 
between different GPUs based on the cube's orientation, which 
effects the size of the rendering rectangle of the dataset por- 
tion. This affects the number of rays that need to be traced and 
the path-length of the rays through the volume, which in turn 
affects the number of sampling operations required for each 
ray. Overlaps between communication and computation usu- 
ally eliminate the differences between processing elements, but 
due to the randomness of this overlapping order, its effect is not 
constant. 

Figure [7] shows a sample timing diagram for the HIPASS 
cube as a function of cube orientation, using the 2P4G config- 
uration (see Table [2]). The rendering time depends on the cube 
orientation, which is represented by the rotation angle, 9, about 
the y-axis in degrees. T ray is the dominating factor for the vari- 
ation in the rendering time with an average near 50%. Also, a 
small variation between the frame rendering times for opposite 
angles, (0 and 180° + ff), is caused by the early ray termination 



optimization step in the ray tracing implementation. The value 
of Max(T merge ) , Max(T comm ), and Max(T servel ) are almost con- 
stant, because they are directly proportional to the final frame 
output size, which is not affected by the cube orientation. 

Figure [8] shows the average, minimum, and maximum frame 
rendering time for the HIPASS cube for different cluster config- 
urations. Figure|9]shows the average, minimum, and maximum 
frame rendering time for the three datasets using the 3P8G con- 
figuration. Based on these timing tests, we can conclude that: 

1. Increasing the number of GPUs does not necessarily re- 
duce the final rendering time. Due to the communication 
and compositing overheads, for a fixed number of GPUs, 
the lower the number of workstations, the lower the frame 
rendering time. 

2. The unbalanced distribution of rendering tasks over the 
GPUs limits the framework speedup. Although different 
rendering tasks are being performed in parallel, the total 
frame rendering time is dominated by the maximum GPU 
rendering time as shown in equation 1 . The current sched- 
uler implementation uses the dataset dimensions and the 
GPU computational power (number of cores and memory 
size) to fairly partition the dataset over the rendering nodes 
and GPUs. But the problem size each GPU tries to solve 
varies based on the dataset characteristics (affects early ray 
termination), and the traced rays' length (because of per- 



spective projection). Figures 10 and 11 demonstrate this 
by showing the different sub-frame rendering time for each 
GPU for the HIPASS cube using the 2P4G configuration. 
Based on the variation in the difference between the max- 
imum, the average, and the minimum rendering time, we 
expect this "unbalanced distribution" influence to disap- 
pear with a large increase in the number of GPUs. On the 
other hand, the average total frame rendering time with- 
out early ray termination and with orthographic projection 
is higher by 12 % from the average total frame rendering 
time with the early ray termination and perspective projec- 
tion. The usage of a better data partitioning and scheduling 
which depends on the dataset characteristics and dimen- 
sions may also improve this behaviour. 



Figure 12 shows timing for the HIPASS data cube with two 



different output sizes 1024 x 1024 and 512 x 512 pixels. The 
timing follows the same pattern for both output sizes but with an 
average time reduction of 69%. This reduction in the frame ren- 
dering time is due to the reduction in T my , the reduction in the 
size of merging operations, and the size of the data exchanged 
between the rendering workstations and the server. 



4. Discussion 

4.1. Framework performance and scalability 

In order to evaluate the performance gain from adopting 
GPUs as the main processing elements, we compared our per- 
formance timing with the distributed volume rendering imple- 
mentation (dvr) introduced by Beeson et al. ( 2003| l. The main 
performance evaluation done for dvr used 2 GHz Pentium 4 
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Table 1 : Hardware specification for the cluster used to evaluate the performance of our framework. 



Index 


GPU Model 




GPU memory 


Processor Model 


System memory 


1 


4x NVIDIA Tesla CI 060 


4 Gigabyte/GPU 


16x Intel Xeon X5550 


18 Gigabyte 


2 


2x NVIDIA Tesla C106C 


17 


4 Gigabyte/GPU 


2x Nehalem i7 


24 Gigabyte 


3 


4x NVIDIA Tesla CI 060 


4 Gigabyte/GPU 


2x Nehalem i7 


24 Gigabyte 


4 


lx GeForce GTX 285 




1 Gigabyte 


lx Intel i7 930 


12 Gigabyte 



Table 3: Sample datasets used to evalute the performance of our framework. 



Dataset Name 


Dimensions (Data Points) 


Source / Credits 


File Size 


Nbody cube 


1024 x 1024 x 1024 


High resolution 1080 J dark matter simulation of 
a 125 Mpc/h box by Swinburne Computations for 
WiggleZ (SCWiggleZ) project (Poole et al 2010, 


4 Gigabyte 






in prep) 






HIPASS Cube 


1721 x 1721 x 1025 


HIPASS Southern Sky, data courtesy Russell Ju- 
rek/HIPASS team 


12 Gigabyte 


GASS Cube 


2502 x 2501 x 1093 


The Parkes Galactic All-Sky Survey, data courtesy 
Naomi McClure-Griffiths/ GASS team (McClure-| 


26 Gigabyte 






Griffiths et al. 


2009) 








30 60 90 120 150 180 210 240 270 300 330 360 

B=Data cube rotation angle (degrees) 

Figure 7: Single frame rendering times for different cube rotation angles. The timing measurements were done for the HIPASS Southern Sky data cube on 3 
workstations (one acting as a server) and 4 GPUs (2 P 4 G). 
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» Min Average » Maximum 
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1P4G 2P4G 2P5G 3P5G 3P6G 3P7G 3P8G 

Configuration 

Figure 8: Minimum, average, and maximum rendering time for the HfPASS data cube using different hardware configurations. The horizontal line shows our target 
of 5 fps for real-time interaction. 
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Figure 9: Minimum, average, and maximum rendering time for the HIPASS cube, GASS Cube, and Nbody cube using the 3P8G configuration. 
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Figure 10: Different GPU sub-frame rendering time for the HIPASS cube using the 2P2G configuration with early ray termination deactivated and orthographic 
projection. 
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G=Data cube rotation angle (degrees) 

Figure 1 1 : Different GPU sub-frame rendering time for the HIPASS cube using the 2P2G configuration with early ray termination enabled and perspective projection. 
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9=Data cube rotation angle (degrees) 

Figure 12: Single frame rendering times for different cube rotation angles with different output sizes ( 1024 X 1024 and 512 X 512 pixels). The timing measurement 
was done for the HIPASS cube on 3 workstations (one acting as a server) and 4 GPUs (2 P 4 G). 



Table 2: Details of the hardware configurations of the cluster (GPU+CPU) used 
to evaluate performance. The letter 'P' is used as an abbreviation for the number 
of processing workstation, and the letter 'G' is used as an abbreviation for the 
number of GPUs. The configurations do not include the server. 



Configuration 


Client(l) 


Client(2) 


Client(3) 


Server 


1 P4G 


4 






1 


2P4G 


2 


2 




1 


3P5G 


2 


2 


1 


1 


3P6G 


2 


2 


2 


1 


3P7G 


2 


2 


3 


1 


3P8G 


2 


2 


4 


1 



CPUs and was able to render around 7 MVox/s. If we scale that 
to current CPU clock speeds (e.g. Westmere 2.93GHz proces- 
sor), dvr should be capable to render 120 MVox/s with 6 cores. 
Consequently, using equation (7) of Beeson et al. ( 2003) and 
by assuming perfect scalability, dvr needs approximetely 120 
x 6-core Westmere processors in order to render the HIPASS 
cube with 5 fps, while our framework can render that cube with 
4 GPUs. 

During our performance and timing tests, we were unable 
to obtain a data file larger than the GASS data cube (26 GB). 
Furthermore, access to a larger GPU cluster was not available. 
Although we remain cautious about conclusions related to scal- 
ability of our framework, we believe it should be able to han- 
dle larger datasets, even of Terabyte order, if proper hardware 
infrastructure is available. We base this conclusion on the fol- 
lowing: 



the amount of data transfer between the rendering work- 
station and the server is almost constant and depends on 



the output resolution rather than the input data size; 

• the usage of ray casting combined with the rendering rect- 
angle optimization makes the problem size almost con- 
stant for the rendering and the communication between the 
nodes; 

• the overlapping between the computation and communi- 
cation minimizes the communication delay; 

• the two stage frame composition minimizes the amount of 
the processing required by the server and the amount of 
data exchanged between the clients and the server; and 

• the local and global composition processes are done on 
GPU with a negligible cost compared to the rendering 
time. 

Moreover, the on-going increase in number of processing 
cores and size of local memory of GPUs will help to decrease 
the rendering time and minimize the number of GPUs needed 
to render a certain data size. For example the next generation 
of NVIDIA Tesla GPUs is expected to have double the current 
number of processing elements and 1 .5 times the current mem- 
ory size. 

4.2. Future Work 

We expect the performance of our framework will be dramat- 
ically enhanced by using the next NVIDIA GPU architecture, 
code-named Fermi. Features like predicated instructions, larger 
local memory, larger memory address space, greater DRAM 
bandwidth, improved instruction scheduler, and higher FLOP/S 
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will provide our system with more powerful hardware infras- 
tructure and remove some of the software bottlenecks. Also, 
utilizing the direct integration between the Tesla cards and an 
InfiniBand network infrastructure may decrease the communi- 
cation overhead and provide improved scalability^] 

To overcome the current GPU memory size limitation, we 
believe that lossy compression of the cube portions stored in 
the GPU local memory, using wavelet compression, will en- 
able us to store and render larger cubes without affecting the 
final render quality. There have been a few trials to visualize 



datasets stored in the compressed wavelet format ( Nguyen and 



Saupe 2001 1 Guthe et al. 2002 1, but to our knowledge none 
of these algorithms have been ported to GPU yet. It is likely 
that wavelet techniques can also be applied as a noise removal 
tool to increase the output quality. The usage of wavelet com- 
pression was demonstrated by |Pence et"aT] ( |2009| l with optical 
data, however, some modifications are required to enable fast 



decompression and retrieval of random data positions (Nguyen 



and Saupe 200TJ. 

Quantitative data visualization support is still a missing in- 
gredient from a complete visualization and analysis system for 
astronomy, and may be the main factor limiting a wider adop- 
tion of 3D visualization in astronomy. We aim to examine this 
further in future work. A closely related issue is the use of 
noise-suppression techniques, as faint signals may be hidden in 
large-scale noise features. Designing one or more specialized 
transfer functions can provide the user with better visualization 
results. Most of the current well known transfer functions may 
not provide the user with the best visualization output because 
they were designed to serve other scientific domains. Transfer 
functions capable of suppressing noise and emphasising impor- 
tant data features will provide the users with better visualization 
outcomes and enhance the usefulness of 3D visualization as a 
data analysis and knowledge discovery tool. 

5. Conclusion 

Visualization is a valuable tool for knowledge discovery. 
Along with providing insight and opportunities for analysis of 
sources under investigation, global views of data are vital for 
the detection of instrumentation errors, and the identification 
of data artefacts and noise characteristics. New approaches 
are needed to visualize the massive, Terabyte order, data cubes 
that will be produced routinely by facilities such as ASKAP, 
MeerKat, LOFAR and ultimately, the Square Kilometre Array. 

In this work, we have introduced a framework to visualize 
larger than memory multispectral 3D datasets. The framework 
provides the user with a real-time interactive volume rendering 
by combining between shared and distributed memory architec- 
tures, employing a distributed GPU infrastructure, and using the 
ray-casting volume rendering algorithm. We are trying to pro- 
vide astronomers with a more affordable solution to deal with 
the upcoming data avalanche, by offering GPUs as an alterna- 
tive to the currently used distributed computing infrastructures. 



8 http://www.nvidia.com/object/ io_1258539409179.html 



By reducing the number of machines required to handle such 
datasets we not only reduce the overall hardware cost but also 
we provide an easier to deploy, and hence manage, solution. A 
remote viewer application is used to enable the user to control 
and interact with the framework. System implementation was 
done using QT, MPI, and CUD A within a C++ object oriented 
framework. 

Framework performance was evaluated using a cluster of 
four workstations and nine GPUs. The performance evalua- 
tion and timing tests were used to show the framework scala- 
bility, how different framework processes contribute to the final 
rendering time, and the effect of changing the cube orientation 
and the output viewport size on the rendering time. Medium 
size (12 GB) and relatively larger data cube (26 GB) were used 
throughout the timing tests. The maximum total rendering time 
for a 26 GB data cube with a 1024 x 1024 output viewport was 
< 0.3 second, with frame rates of 5 fps achievable. 

Based on the framework performance and timing analyses, 
we believe it should be able to visualize larger datasets, even of 
Terabyte order, if proper hardware infrastructure is available. 
The usage of ray casting, the overlapping between communi- 
cation and computation, and the two stage results compositing 
minimize the parallelization overhead and the final frame ren- 
dering time. 
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